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ABSTRACT 


A  two-dimensional  Bernard-Rayleigh  convection  problem  is 
solved  numerically  as  an  initial-and  boundary-value  problem  on  a 
network  of  (M+2)  x  (N+2)  grid  points.   The  purpose  of  this  paper 
is  not  to  discuss  the  result  of  numerical  integrations,  but  to 
investigate  the  efficiency  of  ILLIAC  IV  for  this  class  of  hydros 
dynamic  problems. 

Two  cases  are  considered:   The  first  case  is  a  program 
for  N<62  and  the  second  is  a  program  for  arbitrary  N  and  M.   In 
neither  case  is  the  use  of  auxiliary  storage  systems  considered  in 
general,  although  an  example  of  the  use  of  ILLIAC  Disk  is  discussed. 

The  efficiency  of  ILLIAC  IV  in  the  convection  problem  is 
investigated  in  the  following  -ways: 

(a)  The  program  is  written  in  a  high  level  language  for 
ILLIAC  IV  (GLYPNIR)  and,  using  the  assembly  codes  generated  by  the 
GLYPNIR  compiler,  the  clock  time  is  estimated  for  a  mesh  size  of 

17  x  6k.      This  clock  time  is  compared  to  the  computer  time  required 
for  the  same  problem  on  the  CDC  GhOO   using  a  FORTRAN  code.   The  ratio 
is  found  to  be  approximately  1:1^0  for  the  first  case  and  1:130  for 
the  second  case. 

(b)  The  main  part  of  the  program  within  time  iterations 
is  written  in  the  ILLIAC  IV  assembly  language  (ASK)  and  inserted 
into  the  original  GLYPNIR-generated  codes.   This  partially  ASK-coded 
program  is  found  to  reduce  the  total  estimated  clock  time  by  about 
6Cffo.      The  execution  time  ratio  of  the  partially  ASK-coded  program 

to  the  CDC  6^00  FORTRAN  code  is  about  1:370  for  the  first  case  and 
1:300  for  the  second  case. 

(c)  An  analysis  of  the  utilization  of  the  processing 
elements  (PE's)  and  control  unit  (CU)  shows  that  the  PE's  of  ILLIAC 
IV  are  in  use  during  50%   of  the  time  in  both  of  the  GLYPNIR  programs. 
This  is  increased  to  about  Q5%   in  the  partially  ASK-coded  programs. 
One  factor  for  the  relative  speed  of  ILLIAC  IV  compared  to  serial 
machines,  in  addition  to  the  parallel  operations  in  PE's,  is  the 
large  overlap  (about  60%   in  the  partially  ASK-coded  programs)  in 

the  execution  of  CU  and  PE  instructions. 


(d)  An  analysis  of  the  utilization  of  PE's  and  CU  is 
further  made  to  investigate  the  percent  of  time  spent  to  execute  data 
access  instructions  which  are  not  directly  related  to  arithmetic 
calculations  such  as  address  calculation,  address  index  modification, 
route  and  mode  hit  setting  and  control  instructions . 

(e)  The  solution  for  a  non-core  contained  mesh  is  dis- 
cussed.  Definitive  statements  concerning  effective  use  of  ILLIAC 
Disk  are  not  possible  since  efficient  Disk  utilization  is  generally 
related  to  specific  mesh  sizes  and  choice  of  numerical  methods.   However , 
an  example  is  presented  in  order  to  indicate  the  considerations  necessary 
for  effectively  using  the  ILLIAC  Disk. 
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1 .   Introduction 

Partial  differential  equations  describing  fluid  motions 
of  interest  are  usually  non-linear  and  solutions  of  these  equations 
can  be  obtained  only  by  numerical  integrations,  except  for  a 
few  special  problems.   When  one  deals  with  finite  difference 
approximations  to  partial  differential  equations,  the  opportunities 
for  parallel  computation  are  omnipresent.   This  is  because  most 
numerical  integrations  of  partial  differential  equations  over  a 
domain  involve  a  large  grid  network  where  the  same  computation  is 
performed  at  each  grid  point,  excluding  the  boundary  points.   An 
application  of  parallel  processing  to  numerical  weather  prediction 
was  discussed  by  Carroll  and  Wetherald  (1967).   Recently,  Miranker 
(1970)  surveyed  parallelism  in  numerical  analysis.   The  ILLIAC  IV 
computer  consists  of  6k   processing  elements  (PE's)  which  can 
respond  simultaneously  to  an  instruction  stream  from  a  single 
control  unit  (CU).   It  therefore  provides  a  most  powerful 

tool  for  these  numerical  investigations  of  fluid  motions. 

The  present  work  was  done  to  partially  answer  the  question 
of  how  efficient  ILLIAC  IV  will  be  for  solving  hydrodynamic 
problems.   As  a  typical  hydrodynamic  problem  in  this  efficiency 
study,  we  have  chosen  a  two  dimensional  convection  problem.   The 
reasons  for  this  selection  are  (a)  the  basic  equations  for  this 
problem  govern  a  wide  variety  of  fluid  motions  of  hydrodynamical 
and  geophysical  interest  ,  (b)  the  governing  equations  include  those 
transport,  diffusion  and  elliptic  equations  which  appear  very 
frequently  in  hydrodynamic  and  mechanical  problems,  and 


This  set  of  equations  describes  convective  motion  induced 
by  the  buoyancy  force  and  associated  heat  transfer.   If  the  temperature 
variation  is  suppressed,  the  equations  governing  motions  in  an  incom- 
pressible fluid  are  recovered.   With  additional  equations  for  water 
substances  and  with  some  modifications,  the  equations  describe  various 
types  of  convective  motion  of  meterological  interest  such  as  thunder- 
storms and  squall  lines.   With  inclusion  of  an  equation  for  salinity, 
the  equations  describe  thermohaline  circulations  in  the  oceans. 


(c)  the  governing  equations  are  simple  enough  to  permit  detailed 

2 
investigation  of  parallelism  in  this  class  of  numerical  computation. 

Therefore,  the  usefulness  of  the  information  obtained  in  this  study- 
is  not  restricted  to  the  convection  problem. 

The  efficiency  of  a  parallel  computer  can  generally 
vary  from  less  than  1%   (when  only  one  processing  element  is  turned  on 
and  data  must  be  transferred  from  distant  processing  elements)  up 
to  100$.   Within  the  framework  of  the  present  organization  of  ILLIAC 
IV,  the  efficiency  depends  upon  many  factors.   Most  important  are: 
(a)  algorithms  for  solving  the  problem,  including  storage  schemes 
and  methods  of  handling  boundary  conditions  so  that  the  maximum 
efficiency  of  the  parallelism  can  be  obtained,  and  (b)   programming 
languages,  particularly  high  level  languages.   An  example 
of  required  efficiency  considerations  is  thst  because  the 
random  access  memory  of  ILLIAC  IV  is  distributed  among  6U  processing 
elements,  a  considerable  amount  of  routing  and  mode  bit  settings  is 
necessary.   The  time  required  for  data  access  instructions  compared 
to  that  for  arithmetic  instructions  is  therefore  important  in 
determining  efficiency  of  ILLIAC  IV  program  languages. 

In  this  paper,  we  solve  the  convection  problem  numerically 
as  an  initial-  and  boundary-value  problem,  using  well  established 
finite-difference  schemes.   Our  first  attention  is  paid  to 
storage  schemes  which  would  provide  high  efficiency  of  the  parallelism. 
The  program  is  then  written  in  GLYPNIR,  a  high  level  language  for 
ILLIAC  IV.   Using  the  assembly  codes  generated  by  the  GLYPNIR  compiler, 
we  estimate  clock  time  required  for  completion  of  the  numerical 
integrations  of  our  basic  equations.   We  also  solve  the  same  problem 
using  a  conventional  serial  computer,  and  the  required  clock  time  is 
compared  with  the  estimated  clock  time  for  ILLIAC  IV. 
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Although  previous  studies  have  been  performed  which  dis- 
cuss the  general  applicability  of  ILLIAC  IV  to  such  equations,  no 
detailed  analysis  of  the  type  performed  here  is  available. 


Next,  we  program  the  main  portion  of  the  program  in  the 
ILLIAC  IV  assembly  language,  ASK,  and  compare  it  with  the  assembly 
codes  generated  by  the  GLYPNIR  compiler.   This  comparison  provides 
useful  information  about  the  efficiency  of  GLYPNIR. 


h 

2 .   Description  of  the  Problem 

2. 1  Governing  Equations  for  Two-dimensional  Bernard-Ray leigh  Convection 

In  this  paper  we  shall  consider  a  simple  physical  situation 
where  a  thin  fluid  layer  confined  between  two  horizontal  plates  is 
heated  from  below  and  cooled  from  above  uniformly.   When  the  tempera- 
ture difference  between  the  bottom  and  top  plates  (or  more  generally 
the  Rayleigh  number)  exceeds  a  critical  value,  convective  motion  sets 
in. 

Using  the  coordinate  system  where  the  z-axis  is  pointing 
upward  vertically  and  the  x-axis  is  horizontal,  the  equations  for 
motion,  the  first  law  of  thermodynamics,  and  mass  continuity  are 
written  in  non-dimensional  form  as  (  see  Ogura  and  Yagihashi  (1971)  ) 


9u  ,    9u  ,    9u     3P    o  /  -,  s 

—  +u  —  +w  —  =-—  +V2u  1 

9t      9x     9z      9x 


■p 
9w    ,         9w  9w  9P  a  m    ,    „o  ,ON 
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3T  9T  9T        1      „9m  ,0x 

H  +  uil  +  w^  =  -v2T  (3) 

r 


9x        9z  K    ' 


where  u  and  w  are  velocity  components  along  the  x-  and  z-axis 
respectively,  t=time,  T=temperature  and  P=pressure. 


In  deriving  the  above  equations,  d,  d2/v,  AT,  v/d  and  p  v2/d 

are  used  as  scale  factors  for  length,  time,  temperature,  velocity  and 

pressure  respectively  where  d  =  the  separation  between  the  two 

horizontal  plates,  AT  =  the  temperature  difference  between  the  bottom 

and  top  plates,   p   =  constant  density  and  v  =  kinetic  viscosity. 

0 
The  non-dimensional  parameters  in  the  above  equations  are  defined  as 

R   (Rayleigh  number)  =   " — 
a  kv 


P   (Prandtl  number)  = 


v 


r  k 

where  g  =  acceleration  of  gravity,  a  =  coefficient  of  volume  expansion 
and  k  =  thermometric  diffusion. 

The  pressure  terms  in  equations  (l)  and  (2)  can  be  elimi- 
nated by  cross-differentiation  and  subtraction  to  derive  an  equation 
for  vorticity  (n)  in  the  direction  perpendicular  to  the  x-z  plane: 

3t     3x     3z   P   3x 

r 

It  is  convenient  to  introduce  the  stream  function  (\\>)   which  is  defined  by 

»-.  _  it ,  v  .  |t .  (6) 

3z       3x 

n  is  then  given  by 

3w    3u   „2  i  (7) 

3x    3z 

With  proper  initial  and  boundary  conditions,  the  set  of  equations  (3), 
(5)  and  (T)  determines  the  flow  system  completely. 

In  the  numerical  approach  used,  the  calculation  begins  with 
no  convective  motion  but  with  an  initial  uniform,  supercritical, 
vertical  temperature  gradient.   Motion  is  initiated  by  introducing  a 
random  perturbation  of  the  order  of  (1/100)  into  the  temperature  field. 


The  boundary  conditions  used  are: 
T  =  l,ijj  =  n  =  0     at  z  =  0 


T  =  ty   =   n  =  0 
f  =*  =  n  =  0 


at  z  =  1  , 

at  x  =  0  and  x  =  2.828U 


The  above  set  of  differential  equations  was  then  transformed 
into  finite-difference  forms   (see  Section  2.2)  and  a  numerical 
integration  was  performed  using  a  CDC  6^00. 

The  values  of  the  parameters  used  are : 

R  =  1315  =  twice  the  critical  Rayleigh  number, 
a 

P  =  0.713,  the  value  for  air, 


Ax  =  2.828U/n, 


Az  =  l/m, 


At  =  0.01, 


where  n  and  m  are  the  number  of  intervals  along  the  x-  and  z-axis 
respectively. 

Figure  1  shows  the  variations  with  time  of  temperature  and 
vorticity  at  a  grid  point  (i  =  2  and  k  =  8)  thus  calculated.  We  ob- 
serve that  the  motion  attains  an  approximate  steady  state  at  about 
300  time  steps.   Figure  2  shows  the  temperature  field  (solid  line  at 
0.1  intervals)  and  the  stream  function  field  (dashed  lines  at  unit 
intervals)  at  time  step  500. 

2.2.   Finite-difference  Scheme  used  for  Prognostic  Equations 

The  finite-difference  schemes  used  to  approximate  the 
thermodynamic  equation  (3)  and  the  vorticity  equation  (5)  are: 
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Figure  1 
Variations  with  Time  of  Temperature  and  Vorticity 
at  a  Grid  Point  (i  =  2  and  k  =  8) 


Figure  2 

Temperature  Field   (solid  lines   at  0.1  intervals!   anrt  th. 

Stream  Function  Field   (dashed  lines   at   1  inter^)^   Time   Step  500 


In  the  above  equations,  z=(Az)i,  x=(Ax)k  and  t=(At)r  with  i,  k, 
x=0,l,  2,  ...  .   The  above  finite-differencing  schemes  are 
essentially  the  same  as  those  used  by  Lilly  ( 196*0,  except  that  the 
diffusion  terms  are  evaluated  by  using  the  DuFort-Frankel  scheme. 
Without  violating  linear  computational  stability,  this  permits 
use  of  a  larger  time  increment  than  that  required  if  the  diffusion 
terms  were  evaluated  by  forward  time  differences.   Schemes  used 
here  preserve  the  total  energy  of  the  system  and  the  variance  of 
temperature  in  the  finite -difference  form  in  a  manner  analogous 
to  that  of  the  continuous  system. 

2.3-   Numerical  Method  Used  in  Solving  Poisson's  Equation 

Various  numerical  methods — iterative  and  direct — have 
been  used  to  solve  Poisson's  equation,  V2\p  -  r\.      Since  the 
successive  overrelaxation  (SOR)  method  can  be  used  in  any  shape  and 
size  of  domain  with  a  variety  of  boundary  conditions,  it  was  chosen  for 
this  study.   In  solving  the  Poisson  equation  on  the  (M+2)  x  (N+2) 
rectangular  domain  shown  in  Figure  3  on  serial  machines ,  the 
SOR  algorithm  improves  interior  components  in  order  of  i|>,  -i,^  o,  •  •  •  ■> 
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Figure  3.    ij)'s  on  Rectangular  Mesh 
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The  asymptotic  rate  of  convergence  of  SOR  is  independent 
of  the  order  in  which  the  components  are  improved.   Thus  an  ordering 
which  is  easily  adapted  to  ILLIAC  IV  is  used  (see  Ericksen  (1972)  ). 
The  interior  points  are  divided  into  two  classes,  "odd"  and  "even" 
where 

"odd"  =  {ty.     .        i  +  j  is  odd}  and  "even"  =  {\p.    .    i  +  j  is  even} 

All  points  in  the  "even"  ("odd")  class  are  improved  using  "boundary 

values  and  values  of  "odd"  ("even")  points.   In  this  algorithm,  all 

st 
"even"  points  in  the  (£+l)  '  iteration  are  improved  first  using 

-  yn.  .  +  (1-co)  *(£). 
Then  the  odd  points  are  improved  using 

-i.o +  <-^:]  <8) 

where  w  is  the  optimal  relaxation  factor,  and 

u>(AZ)2  n  (i)(A.X)2 


a  = 


and      Y  = 


2[(Ax)2  +  (Az)2]  2[(Ax)2  +  (Az)2) 

oi(Ax)2  (Az)2 


2[(Ax)2  +  (Az)2] 


As  seen  in  Figure  3,  the  PE's  which  contain  the  "even"  points  of  mesh 
row  I  contain  the  "odd"  points  of  row  1+1.   This  enables  us  to  improve 
the  "even"  points  of  row  I  and  1+1  simultaneously  by  using  a  PE 
integer  index  which  accesses  an  element  in  row  I  (i+l)  in  the  odd 
(even)  PE's  when  I  is  even.   In  a  like  manner,  all  the  odd  points 
in  two  consecutive  rows  can  be  accessed.   If  N  =  62  and  M  is  even, 
then  62  out  of  Q\   PE's  are  used  in  performing  most  of  the  calculations. 
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Since  80  to  90  percent  of  the  total  computational  time 
on  the  convection  problem  is  used  to  solve  Poisson's  equation  with 
SOR  on  both  serial  machines  and  on  ILLIAC  XV,  it  is  important  to 
choose  the  numerical  method  for  solving  Poisson's  equation  which  is 
most  efficient  for  the  given  domain  and  boundary  conditions.   A 
study  of  the  efficiency  of  ILLIAC  IV  in  solving  Poisson's  equation 
using  other  iterative  and  direct  methods  is  presented  in  Ericksen 
(1972). 
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3.   Data  Storage  Considerations 

One  of  the  difficult  tasks  in  organizing  an  ILLIAC  IV 
program  is  the  allocation  of  storage.  As  mentioned  in  Appendix  1, 
each  Processing  Element  Memory  (PEM)  is  associated  with  a  specific 
PE.   Therefore,  only  data  stored  in  PEi  can  be  brought  into 
registers  of  PEi  by  a  single  fetching  instruction.   If  6k   different 
data  are  to  be  placed  into  registers  of  all  6k   PE's  simultaneously, 
these  data  must  be  stored  in  the  PEM's  associated  with  the  appropriate 
PE's.   In  placing  data  stored  in  PEMi  into  a  PEj  register, 
one  must  first  bring  this  to  a  register  in  PEi  and  then  route  it 
by  j-i.   Similarly,  data  in  a  PEi  register  can  be  stored  into  PEMi 
by  a  single  store  instruction,  but  requires  routing  before  being 
stored  in  PEMj. 

The  memory  should  be  used  in  such  a  way  that  data  can  be 
regarded  as  contiguous  blocks  without  voids  in  order  not  to  waste 
space.   However,  the  data  should  also  be  stored  in  such  a  way  that 
the  algorithm  keeps  most  of  the  PE's  active  most  of  the  time. 
Frequent  reallocation  of  storage  should  be  avoided.  These  factors 
should  be  balanced  in  a  manner  producing  maximum  overall  speed. 
A  discussion  of  various  storage  schemes  and  their  relative  merits 
is  given  in  Appendix  3«   In  this  paper,  straight  storage  is  used 
throughout . 
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h .   Programming  Languages 

Since  ILLIAC  IV  has  an  organization  different  from  that 
of  conventional  serial  computers,  ILLIAC  IV  programs  must  be 
written  in  a  machine-dependent  language.   Two  programming  languages  for 
ILLIAC  IV  are  considered  in  this  paper.   One  is  the  assembly  language, 
ASK,  which  consists  of  CU  instructions  and  PE  instructions  (see 
Appendix  l).   Needless  to  say,  it  is  time  consuming  to  write  and 
debug  ASK  programs . 

Another  language,  GLYPNIR,  can  be  considered  as  a  high 
level  language  (see  Layman  et  al,  (1972)).  GLYPNIR  generates  the  necessary 
code  to  load  operands,  to  evaluate  expressions,  to  set  up  subroutine  calls, 
etc.   Relative  to  coding  in  assembly  language,  the  human  effort  in  writing 
a  GLYPNIR  program  is  comparable  to  that  required  in  writing  a  FORTRAN  or 
an  ALGOL  program. 

As  is  always  true  for  any  high  level  programming  language, 
assembly  codes  generated  by  the  GLYPNIR  compiler  are  not  as 
efficient  as  ASK  written  by  programmers.   However,  GLYPNIR  allows 
the  user  to  insert,  at  any  points  in  his  program,  specific  assembly 
language  statements  which  may  use  previously  declared  GLYPNIR 
variables  and  ADVAST  Data  Buffers  (ADB's). 

We  have  examined  the  efficiency  of  GLYPNIR  codes  in 
our  problem  in  the  following  way.  The  problem  was  programmed 
in  GLYPNIR  and  the  clock  time  was  estimated.   As  will  be  described 
in  Section  5.1,  almost  all  clock  time  needed  for  the  entire  program 
is  spent  by  execution  of  repeated  instruction  streams  within  time 
interations  (also  see  Section  7.1).   We  will  hereafter  refer  to 
these  instruction  streams  as  the  kernel  of  the  program. 

Because  of  this  fact,  we  reprogrammed  the  kernel  using 
ASK  and  inserted  it  into  the  original  program.   We  call  this  a 
partially  ASK-coded  program.   Since  the  kernel  consists  almost  en- 
tirely of  arithmetic  statements ,  it  is  not  as  hard  to  write  ASK 
code  for  this  portion  of  the  program  as  it  is  to  write  the  whole 
program  in  ASK.   As  will  be  described  in  more  detail  later, 
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the  total  time  taken  by  a  partially  ASK-coded  program  is  less  than 
one- half  the  total  time  taken  by  the  GLYPNIR  program.   Efficiency 
gained  in  ASK  coding  is  mainly  obtained  by  effective  address 
calculations,  storing  CU  constants  "which  are  used  frequently  in  the 
kernel  in  ADB's,  eliminating  data  storing  and  fetching  to  and 
from  PEM's  as  much  as  possible,  and  by  careful  programming  to 
maximize  the  overlap  between  PE,  PEM,  and  CU. 
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5 .   Program  Description 

5 .1  Mesh  Size  and  Data  Storage 

The  processing  element  memory  (PEM)  of  ILLIAC  IV  can  hold 
all  elements  in  a  row  (or  a  column)  in  PEO  through  63  if  the  number 
of  elements  is  not  greater  than  6k.      It  is  obvious  that  the  program 
for  this  case  is  simpler  and  needs  less  overhead  than  that  for  a 
mesh  whose  row  and  column  contain  more  than  6k   elements. 

First  we  have  written  a  program  for  the  mesh  size  (M+2)  x 
(N+2)  where  N  is  not  greater  than  62  and  M  is  less  than  250   so  that 
all  elements  in  a  row  of  this  mesh  can  be  stored  across  PE's  and  no 
auxiliary  storage  is  needed.   We  call  this  Program  I  in  this  paper. 
Program  II  was  then  written  for  an  arbitrary  mesh  size  constrained 
only  by  the  requirement  that  the  data,  along  with  program  instruc- 
tions, can  be  stored  in  PE  memory. 

We  used  the  straight  storage  scheme  because  the  boundary 
calculation  in  this  problem  is  much  less  complicated  than  the 
interior  calculation. 

5-2   Program  Flow 

Following  is  a  brief  description  of  a  program  flow: 

i.   Initialization-calculation  of  constants  and  mode 
patterns  used  in  the  rest  of  the  program.   Setting 
initial  temperature,  vorticity  and  stream  function 
distribution. 


3The  limit  of  250  rows  is  obtained  by  assuming  that  meshes 
n    ,  n      T    ,  T     ,  ty  and  two  temporary  storage  meshes  are 

contained  in  core.   The  two  temporary  storage  meshes  are  used  to  enable 

the  program  to  calculate  V2T^,  J(^T\   T^)  or  V2r/T\  J(i/T\  n    ) 
for  the  entire  mesh  and  store  them  in  temporary  storage  before 

T      or  n      is  calculated.   The  need  for  this  temporary  storage 
can  be  eliminated  by  calculating  T^T+   for  each  line  immediately  after 

calculating  v2T^T'  and  J(i//   ,  t't')  for  that  line—similarly  for1 

(t+1) 

n     .   This  would  increase  the  upper  limit  on  M  to  350. 
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ii .   Computation  of  the  distribution  of  temperature, 
vorticity  and  stream  function  at  time  step  one. 
(Since  the  leap  frog  method  is  used  in  time 
differencing,  the  calculation  of  data  at  time  step 
one  must  be  carried  out  separately). 

iii .   Computation  of  the  distributions  at  time  step  TAU  +  1 
using  data  at  time  step  TAU  -  1  and  TAU,  where 
1  ±     TAU  <_  TMAX.   This  portion  consists  of  the 
following  steps: 

(a)  Compute  vorticity  distribution  of  interior  points. 

(b)  Compute  temperature  distribution  at  interior  points. 

(c)  Compute  temperature  distribution  on  boundaries  along 

dT 
z-axis  according  to  -r—  =  0  at  x  =  0  and  x  =  2.8284. 

dx 

(d)  Compute  distribution  of  stream  function  by  solving 
the  Poisson   equation  V2ijj  =  n. 

(e)  Increase  time  step  by  1. 

(f)  If  this  is  a  time  to  print  out,  compute  horizontal 
and  vertical  velocity  and  print  out  T,  n ,  i|j,  u, 
and  w. 

(g)  If  time  step  reaches  to  TMAX,  terminate  execution: 
otherwise,  go  back  to  (a). 

It  is  apparent  that  steps  (a)  through  (e)  are  executed  TMAX  -  1 
times.   Therefore,  if  TMAX  is  reasonably  large,  most  of  the  clock 
time  is  taken  by  these  calculations.   PE  utilization  in  steps  (a),  (b) 
and  (d)  is  N/64  (for  N  <  62)  because  the  distribution  of  N  interior 
points  is  calculated  simultaneously.   Step  (c)  would  have  the  same 
PE  utilization  as  (a),  (b)  and  (d)  if  points  along  the  z-axis  were 
stored  across  PE's   or  if  skewed  storage  were  used.   In  step  (f)  the 
calculation  of  boundaries  of  horizontal  (vertical)  velocity  has  poor  PE 
utilization  if  we  store  the  boundary  points  x  =  0  and  x  =  2.8284 
(z  =0  and  z  =  1 )  in  PEO  and  PE(N+l)  (PEO  and  PE(M+l)).   Our  choice 
of  straight  storage  results  in  the  simultaneous  utilization  of  only  two 
PE's  in  steps  (c)  and  (f).   Since  the  calculation  of  horizontal  or  vertical 
velocities  (u  =  -  j£     and  w  =  -Ji)  is  simple  and  performed  only  when 
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print  out  is  necessary,  this  inefficiency  in  step  (f)  is  negligible. 
In  our  estimation  of  clock  time  for  this  problem  in  the  next  section, 
we  consider  only  the  clock  time  required  in  steps  (a)  through  (g), 
but  excluding  step  (f). 
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6.   Timing  Estimates 

6.1  Calculating  Execution  Time  on  ILLIAC  IV 

In  calculating  execution  time  of  an  instruction  stream  on 
ILLIAC  IV,  the  various  hardware  characteristics  of  this  machine  must 
"be  considered.   The  time  estimates  and  analysis  in  this  paper  are 
obtained  by  considering  the  following: 

(a)  The  time  taken  by  each  instruction  is  calculated 
according  to  Table  6-1  and  Table  6-2  in  the 
Instruction  Timing  Section  of  ILLIAC  TV  Systems 
Characteristics  and  Programming  Manual. 

(b)  All  FINST/PE  instructions  require  one  clock  time 
for  execution  in  ADVAST  or  two  clock  times  if  ACAR 
indexing  is  required  or  a  CU  operand  is  used. 

(c)  The  FINST/PE  instruction  which  requires  a  second 
operand  fetched  from  PEM  requires  an  additional 
seven  clock  time . 

(d)  There  is  parallelism  between  ADVAST  and  FIUST  when 
they  are  operating  on  different  instructions  in  a 

nearby  portion  of  the  instruction  stream.   The 
execution  of  instructions  on  two  relatively  independent 
stations,  ADVAST  and  FINST,  is  overlapped.   An  in- 
struction queue,  FINQ,  between  these  two  stations  holds 
up  to  8  instructions  to  smooth  the  flow  of  instructions 
through  both  stations. 

(e)  There  is  parallelism  between  noninterfering  portions 

of  successive  FLNST/PE   instructions.    This  overlap 
is  provided  by  two  execution  stations  within  FINST. 
Each  station  contains  a  register  which  receives  the 
next  instruction  from  the  instruction  queue,  FLNQ,  and 
examines  it  when  it  is  time  to  start  the  instruction 
execution.   The  actual  execution  of  the  instruction  is 
carried  out  from  a  second  register.   As  a  result  of 
this  hardware  characteristic,  memory  fetches  which  occur 
at  the  beginning  of  instructions  are  overlapped  with  the 
preceding  FINST/PE  instruction.  As  soon  as  the  PE 
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finishes  executing  the  store  instruction,  it  is  ready 
for  the  next  instruction;  at  the  same  time,  PEM  is 
busy  for  seven  clocks  while  storing  the  data  into  an 
appropriate  location, 
(f)   Concerning  the  execution  time  for  LOAD  and  SETC  in- 
struction, it  is  seen  that  a  LOAD  instruction  with  an 
address  other  than  ADB  causes  the  FINST  queue  to  be 
emptied  and  that  SETC  causes  ADVAST  to  stop  processing 
instructions  from  IWS  until  FINQ  is  empty. 

Programs  I  and  II,  both  in  GLYPNIR,  were  compiled  on  the 
GLYPNIR  compiler  on  the  B67OO  and  the  ASK  codes  were  generated.   As 
described  before,  the  partially  ASK-coded  programs  were  obtained 
by  inserting  the  ASK  codes  for  the  kernels  into  the  programs 
generated  by  GLYPNIR  compiler.   Timing  charts,  an  example  of  which  is 
shown  in  Appendix  2,  were  prepared.   These  charts  show  when,  how  and 
how  much  time  is  spent  in  CU,  PE  and  PEM.   The  total  instruction  time 
required  for  time  integrations  of  the  problem  on  a  grid  mesh 
(M+2)  x  (N+2)  was  computed. 

The  estimated  number  of  ILLIAC  clocks  required  in  Program  I 
per  time  step  is 

T10  ♦  6001,  ♦.  (^)  !t  (M.+S)  **  the  g™ AS*)  ^ 

coded  solution  of  Poisson's  equation,  and 

coded  solution  of  the  prognostic  equations— l~t  is  the  number  of 

required  iterations  in  SOR,  M'  is  the  number  of  rows  of  PEM  required 

,  M  /evenl 

to  store  a  (M+2)  x  (N+2)  mesh,  and  6  =  l-J  for  M1  =  I  odd  J  . 

The  estimated  number  of  ILLIAC  clocks  required  in  Program  II 
per  time  step  is 

900  +  9l.0It  +  (^)  It  (M-  +.)forthe(™ASK)       (ID 
coded  solution  of  Poisson's  equation,  and 

2000  ♦  (^0)  M.  for  the  («™  ^       (12) 

coded  solution  of  the  prognostic  equations. 
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6.2  Comparison  of  Timing  with  CDC  61+00 

In  order  to  compare  the  execution  time  on  ILLIAC  IV  with 
a  conventional  sequential  computer,  the  same  problem  was  programmed 
in  CDC  6U00  FORTRAN  and  run  for  50  time  steps.   If  it  can  be 
assumed  that  the  efficiency  of  CDC  62+00  FORTRAN  and  GLYPNIR  compilers 
are  comparable,  then  timing  comparisons  can  be  considered  as  indica- 
tive of  relative  machine  efficiency  on  this  type  of  problem.   The 

mesh  size  used  is  N  =  62  and  M  =  15  and  the  average  number  of  SOR 
iterations  required  per  time  step  is  21.   We  assume  that  ILLIAC  IV 

runs  at  16  MHz. 

It  was  found  that  the  CDC  61+00  requires  6k   seconds  to  solve 
the  Poisson  equation  by  SOR  and  13  seconds  for  the  rest  of  the 
calculation  (mainly  solving  two  prognostic  equations).   On  the  other 
hand,  0.1+8  second  and  0.062  second  are  taken  on  ILLIAC  IV  by  the 
GLYPNIR  Program  I  for  the  same  calculations.   The  GLYPNIR  to  FORTRAN 
ratio  is  about  1:130  for  the  SOR  solution  of  Poisson' s  equation  and 
1:210  for  the  prognostic  equations.   As  a  whole,  the  execution  time 
ratio  of  ILLIAC  IV  GLYPNIR  Program  I  to  CDC  61+00  FORTRAN  is  l:lU0. 
If  the  problem  were  amenable  to  32-bit  precision,  an  ASK  program 
could  be  written  in  32-bit  mode.   We  estimate  that  the  problem  in 
32-bit  mode  could  be  solved  in  60-70%   of  the  time  taken  in  61+-bit 
mode--  a  brief  discussion  is  given  in  Appendix  h. 

Table  1  shows  the  further  comparison  of  the  computation 
time  taken  by  the  CDC  6^00  FORTRAN  program,  GLYPNIR  Programs  I  and  II 
and  partially  ASK-coded  Programs  I  and  II  for  the  IT  x  6k   mesh. 
Because  this  comparison  is  made  for  the  mesh  which  has  6k   points 
along  the  x-axis,  no  PEM  is  wasted  in  storing  the  data  and  62  out 
of  6k   PE's  are  busy  calculating  new  distributions.   The  time  re- 
quired on  a  serial  machine  is  reduced  if  N  decreases.   However,  if 
the  same  storage  scheme  is  used,  the  same  amount  of  time  is  re- 
quired on  ILLIAC  IV  for  1  <_  N  <_  62 .   Therefore,  when  implementing 
problems  on  ILLIAC  IV,  care  must  be  taken  to  choose  an  appropriate 
mesh  size  and  storage  scheme  (e.g.,  if  N  <_  30,  two  rows  can  be 
stored  across  a  PE). 
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Table  1  indicates  that  Program  II  takes  £0%   more  time  than 
Program  I  when  comparing  GLYPNIR  codes  and  ^©^  more  time  when  comparing 
partial -ASK  codes  for  the  prognostic  equations.   This  compares  to 
a  difference  of  $%   for  GLYPNIR  and  19 #  in  partial -ASK  for  Poisson's 
equation.   The  reason  for  these  differences  is  that  Program  II 
requires  additional  PE  indexing  in  the  prognostic  equations  whereas 
in  Poisson's  equation  the  need  for  PE  indexing  is  nearly  the  same  in 
both  Programs  I  and  II  due  to  the  SOR  algorithm  employed.   PE  indexing 
in  GLYPNIR  is  discussed  in  Section  7.1. 

As  the  mesh  size  or  the  number  of  iterations  required  for  con- 
vergence in  SOR  increase,  equations  (9-12)  indicate  that  the  difference 
between  GLYPNIR  Programs  I  and  II  and  the  difference  between  partial-ASK 
Programs  I  and  II  for  Poisson's  equation  both  become  negligible.   The 
ratio  of  GLYPNIK  to  partial-ASK  execution  times  approaches  3:1  for  SOR. 
For  the  prognostic  equations,  the  ratio  of  GLYPNIR  to  partial-ASK  execution 
times  approaches  2.2:1  as  the  mesh  size  increases. 

Since  few  programmers  ever  claim  that  their  code  is  the  ultimate  in 
efficiency,  we  need  hardly  apologize  for  the  fact  that  the  GLYPNIR  and  ASK 
codes  written  for  this  study  may  be  capable  of  improvement.   However,  it  may 
be  useful  to  the  reader  to  know  our  feelings  as  to  how  well  the  codes  were 
written.   We  believe  that  the  partially-ASK  coded  program  are  close  to 
optimal  (capable  of  5-10%  improvement)  while  the  GLYPNIR  codes  might  be 
speeded  up  by  15-25%  by  a  programmer  intimately  familiar  with  the  GLYPNIR 
compiler. 
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Table  1.  Comparison  Between  Computer  Time  Required      \, 
with  CDC  6400  and  ILLIAC  IV  (N  -  62  and  M  =  15  for  all  cases) 


Partially 

Partially 

CDC   61+00 

GLYPNIR 

GLYPNIR 

ASK-coded 

ASK-coded 

FORTRAN 

Program  I 

Program  II 

Program  I 

Program  II 

Poisson' s 

130 

1 

1-01+ 

0.38 

0.1+3 

Equation 

Prognostic 

210 

1 

1.36 

0.1+7 

O.67 

Equations 

Total 

lUO 

1 

1.08 

0.39 

0.1+5 

In  order  to  compare  to  other  computers,  we  list  relative  throughput  with 
respect  to  the  CDC  61+00  (taken  from  NCAR  computing  facility  estimates): 


Throughput 

w.r.t. 

Computer 

CDC   61+00 

IBM  360/91 

6.0 

IBM  360/75 

1.5 

IBM   360/65 

0.75 

CDC   76OO 

12.5 

CDC   6600 

2.5 

CDC   61+00 

1.0 

DEC   PDP-10 

0.5 

UNIVAC    1108 

1.25 
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7-   Analysis  of  Kernel 

7.1  CU  and.  PE  Utilization 

Whereas  in  the  previous  section  the  overall  time  required 
to  solve  this  problem  on  ILLIAC  IV  was  discussed,  here  an  analysis 
of  the  kernel  is  given  to  determine  how  time  is  spent ,  effectively 
or  ineffectively,  in  calculation  on  ILLIAC  IV.   The  kernel  of  this 
program  consists  of  six  separate  arithmetic  statements  contained 
within  loop  statements   such  as:   LOOP  K:  =  1,  1,  M  DO  <  arithmetic 
statement  i  >;  for  i  =  1,  ...,6. 

Instruction  streams  executed,  repeatedly  M  times  in  each 
loop  statement  are  called  DIFF,  ADW,  ADVT,  PRV,  PRT  and  POISS5. 
DIFF  calculates  diffusive  terms,  V2T  and  V2n,  in  equations  (3)  and 
(5).   In  ADVT  and  ADW,  the  advective  terms   j(ip,T)  and  J ( ^ , n )  of  the 
heat  equation  and  the  vorticity  equation  are  calculated  according  to 
(3')  and  (5')  respectively.   Given  advective  and  diffusive  terms, 
PRV  and  PRT  compute  the  distribution  of  vorticity  and  temperature 
for  time  step  TAU.  +  l  according  to  (3')  and  (5")»   Equations  (3') 
and  (5")  may  be  written  in  terms  of  these  quantities  as  follows: 

t(t+1)  _  1  [c*  t(t^1)  +  2Atf-  ADVT(i,i,k) 
l  ,k     c      l  ,k        V 

+  |-  DIFF(T,T,i,k))  ],  (3"*) 

r 


nU;l}  =  I  [f-  n!T;l}  +  2At  (  -  ADW(r,i,k) 

1  ,  K-        I  1,K  V 


R   T(t)     -   T(l)  -» 

-*       i+l,k      i-l,k  +  DIFF(n,T,i,kn  ],    (5") 

P         2Az 
r 


BU)  ,    +  b(t)  ,    „(t)  +   b(t) 


,  _,   .   s      i+l,k    i-l,k  i  ,k+l    i,k-l  , 

where     DIFF(  B,T  ,i  ,k)  =   * *-  +  * » ' 

(A.z^2  (Ax)2 
ADVT(x,i,k)  S  J-  v^(T)>  t(0)   > 

1  ,K 


and 


ADW(t,i,k)  =j.   (^(t),  n(r))   . 


i,k 


5<Arithmetic  statement  i>  contained  in  each  loop  statement 
is  shown  in  Appendix  5 • 
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PRT  combines  the  terms  on  the  right  hand  side  of  equation  (3  ")   while 
PRV  combines  terms  in  equation  (5"").     Using  the  successive 
overrelaxation  (SOR)  method,  POISS  improves  all  elements  of  ty   in 
I,  iterations.   Timing  estimates  are  given  for  a  single  iteration. 

Table  2a,  2b,  2c  and  2d  show  the  percent  of  the  total 
time  in  which  the  PE's  and  the  CU  are  active  in  GLYPNIR  Programs  I 
and  II  and  in  partially  ASK-coded  Programs   I  and  II,  respectively. 
Since  all  actual  arithmetic  calculations  are  carried  out  in  PE's, 
it  is  desirable  to  have  the  processing  elements  busy  100%  of  the 
time.   However,  this  is  not  the  case.   A  PE  becomes  idle  when  the 
PE  must  wait  for  the  CU  to  send  FINST/PE  instructions  or  when  it  must 
wait  for  operands  being  fetched  from  PEM.   The  PE  idle  time  caused 
by  the  former  will  be  called  CU  bound  and  that  caused  by  the  latter 
PEM  bound.   Tables  2a  -  2d  contain  information  about  the  percentages 
of  the  PE  idle  time  due  to  CU  bound  and  PE  bound.   The  CU  becomes 
idle  when  it  cannot  send  any  FIUST/PE  instructions  into  FINQ  because 
FINQ  is  full,  or  when  the  CU  is  held  up  to  wait  for  FINQ  to  be 
emptied  for  the  execution  of  instructions  such  as  LOAD,  SETC,  etc. 
On  the  average,  a  PE  is  busy  50%  {k%)   of  the  time  in  GLYPNIR  Program 
I  (II)  and  87%  (77%)  in  the  partially  ASK-coded  Program  I  (il). 

Equations  (9-12)  are  seen  to  be  consistent  with  the  Total 
Clock  Time  per  row  in  Table  2  if  one  recalls  that  DIFF  is  used  twice 
in  calculating  the  prognostic  equations--once  in  (3")  and  once  in  (5"). 
The  column  labeled  "Poisson  Per  Iteration"  contains  the  number  of  clocks 
in  each  category  required  in  each  iteration  for  the  17  x  6k   mesh  (21  iter- 
ations were  required  for  this  mesh  size  to  meet  the  convergence  criteria). 

Since  SOR  is  an  iterative  method,  POISS  contains  code  to 
determine  when  the  method  has  converged  for  all  mesh  points.   This 
convergence  test  comprises  26%  of  the  GLYPNIR  POISS  Program  I  and  \% 
of  the  partially  ASK-coded  POISS  Program  I.   If  one  point  in  the  mesh 
fails  to  fall  within  the  error  tolerance,  then  another  iteration  is  re- 
quired.  Once  such  a  point  has  been  found,  testing  of  the  remaining  mesh 
points  is  not  needed.   An  IF  statement  is  used  to  test  if  such  a  point  has 
been  found  and,  if  so,  skips  the  tests  for  the  remaining  mesh  points.   This 
type  of  IF  statement  can  be  performed  in  the  CU  if  the  code  is  written  in 
ASK.   With  a  full  FINQ  at  the  beginning  of  the  IF  statement,  it  could  be 


Table  2.  Utilization  of  Total  Clocktlme 


Table  2a.   Utilization  of  Total  Clocktlme  by  GLYPNIR  Program  I. 


25 


DIFF 

ADW 

ADVT 

PRV 

PRT 

Poisson  Per 

Average 

Per  Row 

Per  Row 

Per  Row 

Per  Row 

Per  Row 

Iteration 

Active  FE  Time 

65   (1*7%) 

117  (51*) 

191   ( 5"*%) 

101   (52*) 

6lt 

52% 

187   ("*>*%) 

■ 

PF  is  PKM  Bound 

72   (53*) 

111  (1*9%) 

160   (1*6*) 

89   !>*5*) 

60 

1*8%) 

176   (1*2*) 

(U6*) 

PE  is  CU  Bound 

C   (  0*) 

0  (  0*) 

0   (  0*) 

6   (  3*) 

0 

0%) 

59   (l1**) 

(  1**) 

Active  >'EM  Time 

91   (66*) 

133  (58*) 

182   (52*) 

105   ( 5**%) 

70 

56%) 

218   (52*) 

(55%) 

Active  CU  Time 

95   (69*) 

123  (5U*j 

177   (50*) 

12"*   (63*) 

86 

69%) 

2<*5   (58%) 

(58%) 

Total  lime  (clocks) 

137 

228 

351 

196 

12U 

1*22 

Table  2b.   Utilization  of  Total  Clocktime  by  GLYPNIR  Program  II. 


DIFF 

ADW 

ADVT 

PRV 

PRT 

Poiss 

on  Per 

Average 

Per  Row 

Per 

Row 

Per 

Row 

Per  Row 

Per  Row 

Itera 

tion 

Active  PE  Time 

76 

50%) 

181* 

(50%) 

271 

(>*9*) 

111   (53*) 

61* 

(52*) 

187 

(1*1*%) 

Ff  Is  PEM  Bound 

77 

50*) 

187 

(50%) 

260 

(1*7%) 

91 

60 

(1.8%) 

176 

(1*2%) 

(1*6%) 

Pr  is  C  :  nound 

0 

0*) 

0 

(  0%) 

21 

(  "*%) 

i    (  It*) 

0 

59 

(ll*%) 

(  5%) 

Active  PEN  1 ime 

98 

6i*%) 

22U 

(6o*) 

308 

(56%) 

112   (53*) 

70 

(56%) 

218 

(52*) 

Active  CU  Time 

101 

66%) 

178 

(W*) 

265 

(1*8%) 

129   (61*) 

86 

(69%) 

2l*5 

(58%) 

(55*) 

Total  Time  ( clocks  1 

153 

371 

552 

210 

121* 

1*22 

Table  2c. 

Utilization  of  Total  C 

locktime  by  ASK 

Program  1 

DIFF 

ADW 

ADVT 

PRV 

PRT 

Poisson  Per 

Per  f 

DW 

Per  Row 

Per  Row 

Per  Row 

Per  Row 

Interation 

Average 

■ ive  PE  Time 

(95%) 

91 

■ 

11*0   (96%) 

78   (92%) 

1*9    (78%) 

107   (77%) 

(87%) 

PF  is  PEM  Bound 

3 

(  54) 

11* 

13%) 

(  1**) 

7   (  8%) 

1U 

10   (  7%) 

(  9%) 

PF  is  CU  Bound 

0 

(  (A) 

0 

0%) 

0   (  0%) 

0     0 

0    (  0%) 

22   (16%) 

(  U«) 

Active  PEM  Time 

28 

(1*7%) 

59%) 

8U   (58%) 

1*9   (58*) 

1*2    (f7*) 

70   (50*) 

(56%) 

Active  CU  Time 

51 

(88*) 

53 

50%) 

77   (53%) 

K)   (71%) 

38    (60%) 

109   (78%) 

(65%) 

Total  Time  (clocks) 

..0 

107 

1.1*6 

85 

63 

139 

Table  2d.   Utilization  of  Total  Clocktlme  by  ASK  Progr 


DIFF 

ADW 

ADVT 

PSV 

PRT 

Poisson  Per 

Per  Row 

Per  Row 

Per  Row 

Per 

<ow 

Per  Row 

Interation 

Average 

active  PF  litre 

5i    (71%) 

113 

190   (77%) 

75 

1*9 

78%) 

108   (77%) 

(77*) 

PE  is  PFM  Bound 

20 

35    (2U%) 

58   (23%) 

17 

(18%) 

11* 

22%) 

10   (  7%) 

(20%) 

PI  is  CU  Bound 

0   (  0%) 

0   (  0%) 

0   (  0%) 

0 

(  0%) 

0 

0%) 

22    (  11  '  ) 

(  3%) 

Active  PEM  Time 

1*9   (i  ".) 

112   (76*) 

175   (71%) 

70 

i  ;■  -  i 

1*2 

67%) 

77   (55%) 

(68%) 

Active  cu  Time 

(72%) 

66   (1*5%) 

88   (35%) 

65 

(71%) 

38 

60%) 

110   (79%) 

(55%) 

Total  Time  (clocks) 

78 

1U3 

21*8 

9? 

63 

1U0 
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completely  masked  by  the  PE.   In  GLYPNIR,  part  of  all  IF  statements  are 
performed  in  the  PE's.  Communication  from  the  PE  to  the  CU  is  performed  by 
SETC,  an  instruction  which  requires  FINQ,  to  be  drained.   This  is  the  primary 
reason  for  the  execution  time  ratios  between  GLYPNIR  and  ASK  being  higher 
for  POISS  than  other  subroutines.   (The  time  required  for  convergence  test- 
ing could  be  reduced  by  testing  at  every  G   iteration  or  by  not  testing  at 
all  until  after  the  L   iteration.   Familiarity  with  the  convergence 
properties  of  a  particular  set  of  parameters  allows  wise  choices  of  G  or  L. ) 

Comparison  between  Table  2a  and  2c  indicates  that  the  PEM 
bound  is  reduced  considerably  in  the  partially  ASK-coded  program. 
The  reduction  of  PEM  bound  is  due  to  the  increased  overlap  of  PE 
and  PEM  (see  Table  k)    and  less  frequent  memory  fetch.   It  is  also 
observed  that  the  average  percentage  of  PEM  bound  in  Table  2d  is 
fairly  large  compared  to  that  in  Table  2c.   This  is  due  to  the 
additional  PE  indexing  required  in  Program  II  for  handling  the  general 
mesh  size.   The  extra  indexing  for  Program  II  also  increases  the 
time  for  some  of  the  GLYPNIR  subroutines,  but  it  doesn't  show  up  in  the 

percentage  of  PEM  bound  (see  Tables  2a  and  2b). 

The  increase  in  the  total  clock  time  for  Program  II  is 
greater  in  ADVT  and  ADW  than  in  the  other  subroutines.   ADVT  and 
ADW  require  calculations  such  as 

Tl  =  RTL(1,M0DE,(PSI[KKPL]  -  PSI[KKMI])*ETA[KK]) 

-  RTR(1,M0DE,  (PSI[KKPL]  -  PSl[KKMl] )*ETA[KK] ) ; 
for  Program  I,  while  Program  II  uses 

Tl  =  RTL(l, MODE, ( PSI [KKPL+L]  -  PSI  [kKMI+L] )*ETA[KK+L] ) 

-  RTR  ( 1, MODE, ( PSI [KKPL+R]  -  PSl[KKMI+R] )*ETA[KK+R] ) ; 
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The  code  for  Program  I  can  be  simplified  as  follows: 

T^  =  (PSI[KKPL]  -  PSI[KKMI])*ETA[KK]; 
Tl  =  RTL(1,M0DE,TU)  -  RTR(l  ,MODE,Tl* ) ; 

Program  II  cannot  use  this  improvement  because  of  the  PE  indexes  L 
and  R.   However,  the  code  for  Program  II  could  be  improved  in  a 
similar  manner  by  using  extra  logic  and  a  considerable  amount  of 
temporary  storage.   This  is  not  done. 

PREAL  vectors  which  are  passed  to  subroutines  in  GLYPNIR 
are  accessed  by  PCPOINTERS  (a  form  of  indirect  addressing).   Since 
the  PCPOINTERS  are  stored  in  PEM,  the  subroutine  must  access  PEM 
one  more  time  than  otherwise  required  when  a  line  of  the  vector 
is  read.   In  Table  2a,  the  active  PEM  time  for  DIFF  is  more  than  three 
times  larger  than  the  corresponding  value  in  Table  2c.   This  is 
due  to  the  elimination  of  PCPOINTERS  in  performing  the  indirect 
addressing  in  the  ASK  code.   The  ASK  code  uses  CU  values  to  do  the 
indirect  addressing  required  to  pass  vectors  to  and  from  subroutines. 

It  should  be  noted  that  available  information  on  PE  and  PEM 
overlap  is  incomplete,  thus  necessitating  some  assumptions.   We  have 
assumed  maximum  overlap  between  the  processing  and  memory  fetching  of 
any  two  consecutive  instructions.   In  the  worse  case,  no  overlap 
could  lead  to  an  increase  in  our  ASK  timing  estimates  of  about  50$ 
and  an  increase  in  our  GLYPNIR  timing  estimates  of  about  10$.  (in 
Table  2,  the  Active  PEM  Time  estimates  are  the  upper  limits  of  PEM 
Bound). 
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7.2  Analysis  of  CU  and  PE  Utilization 

In  this  section,  we  shall  classify  all  instruction  streams 
in  the  program  into  two  categories  depending  upon  whether  or  not 
instructions  perform  operations  directly  related  to  the  arithmetic 
calculation.   Instructions  in  the  first  category  are  those  which 
perform  arithmetic  calculations,  store  data  and  fetch  data  to  and 
from  given  locations.   PE  instructions  in  the  first  category  are 
grouped  under  the  heading  Arithmetic  Instructions.   CU  instructions 
in  the  first  category  labeled  CU  Operand  Fetching  consist  of  those 
CU  operations  which  fetch  operands  for  arithmetic  calculations  in 
the  PE.   Instructions  in  the  second  category  perform  one  of  the 
following  four  different  functions: 

(a)  Address  calculation  and  address  index  modification: 
This  includes  a  calculation  of  PEM  location  of  the 
data  to  be  fetched  or  stored.   The  CU  and  PE  integers 
used  for  indexing  PE  variables  within  a  loop  state- 
ment, such  as  JI  in  PSl[Jl],  must  be  modified  each 
time  whenever  this  loop  is  executed. 

(b)  PE  instruction  decoding:   The  CU  requires  one  clock 
to  decode  each  PE  instruction.   Thus,  this  time  also 
represents  the  number  of  PE  instructions  which  are 
executed. 

(c)  Route  of  data  and  mode  bit  setting:   These  are 
characteristic  functions  on  ILLIAC  IV.   They  are 
RTL,  LDEE1,  SETE,  SETE1 ,  etc. 

(d)  Control  instructions:   An  example  is  an  instruction 
stream  which  tests  and  determines  if  a  loop  has  been 
executed  a  prescribed  number  of  times,  or  determines 
whether  certain  conditions  are  met  or  not. 
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We  have  computed  the  percentage  of  time  used  by  these  different 
types  of  instructions  of  the  PE's  and  CU.   These  are  reported  in  Tables  3 
and  k.      Table  k   also  includes  the  percentage  of  time  during  which  PE  and 
PEM  are  overlapped.   Increased  PE  and  PEM  overlap  can  be  obtained  by 
conscientious  instruction  ordering;  e.g.,  rather  than  writing  ASK  code 
"LDA  A;  MLRN  B;  STA  C;  LDS  Q",  one  can  write  "LDA  A;  MLP.N  B;  LDS  E;  STA  C;  "— 
this  permits  the  content  of  E  to  be  fetched  from  PEM  while  the  PE  is  executing 
a  multiplication. 
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Table  3.  CU  Utilization 
Table  3a.  CU  Utilization  by  GLYPNIR  Program  I. 


DIJT 

ADW 

AJVT 

PRV 

PRT 

Poisson  Per 

Average 

Per 

How 

Par 

Row 

Per 

Row 

Per 

Row 

Per  Row 

Iteration 

Address  and   Index 
Calculation 

38 

(UO*) 

U6 

(37*) 

70 

(UO*) 

32 

(26*) 

2U 

(28*) 

50          (20*) 

<30*) 

Route  and  Mode  Bit 
Setting 

0 

(    0*) 

0 

(  o*) 

0 

(  o*) 

12 

(10*) 

0 

(    0*) 

12          (    5*) 

(    3*) 

Control  Instructions 

22 

(23*) 

22 

(18*) 

22 

(12*) 

22 

(17*) 

22 

(26*) 

75         (31*) 

(22*) 

PE  Instruction  Decoding 

27 

(29*) 

51 

(U2*) 

81 

("•6*) 

U6 

(37*) 

28 

(32*) 

96         (39*) 

(39*) 

CU  Arithmetic  Operand 
Fetching 

8 

(  8*) 

U 

(   3*) 

U 

(    2*) 

12 

(10*) 

12 

aw 

12         (   5*) 

(  6*) 

Total  Active  CU  Time 
(clocks) 

95 

123 

177 

12U 

86 

2l»5 

Table  3b.   CU  Utilization  by  GLYPNIR  Program  II. 


DIFF 

ADW 

ADVT 

PRV 

PRT 

Poisson  Per 

Average 

Per 

Row 

Per  Row 

Par  Row 

Per  Row 

Per 

Row 

Itera 

.ion 

Address  and   Index 
Calculation 

38 

(37%) 

58       (3U*) 

78        (29*) 

32          (25*) 

2U 

(28*) 

50 

(20*) 

(28*) 

Route   and  Mode  Bit 
Setting 

0 

(   0%) 

0        (    0*) 

0          (    0*) 

12          (    9*) 

0 

(    0*) 

12 

(    5*) 

(    3*) 

Control  Instructions 

22 

(22*) 

22        f13*) 

58          (22*) 

22          (17*) 

22 

(26*) 

75 

(31*) 

(22*) 

PE  Instruction  Decoding 

33 

(33%) 

88       (51%) 

125         (U7*) 

51         (UO*) 

28 

(32*) 

96 

(39*) 

(US*) 

CU  Arithmetic  Operand 
Fetching 

8 

(  m 

U       (  2*) 

U           (    2*) 

12         (   9*) 

12 

au*> 

12 

(   5*) 

(   5*) 

Total  Active  CU  Time 
(clocks) 

101 

172 

265 

129 

86 

21*5 

Table  3c.   CU  Utilization  by  ASK  Program  I 


DIFF 
Per  Row 

ADW 
Per  Row 

ADVT 
Per  Row 

PRV 
Per  Row 

PRT 
Per   Row 

Poisson   Per 
Interation 

Average 

Address  and   Index 
Calculation 

22 

(U2*) 

17     (32*) 

28       (36*) 

13 

(22*) 

6       (15*) 

30       (27*) 

(30*) 

Route  and  Modebit  Setting 

0 

(   0*) 

0      (   0*) 

0 

6 

(10*) 

0        (   0*) 

6       (  5*) 

(    3*) 

Control  Instructions 

8 

(15*) 

8     (15*) 

8       (11*) 

8 

(13*) 

8        (21*) 

28       (26*) 

(18*) 

PE  Instruction  decoding 

15 

(28*) 

2U     (1*5*) 

37       (U8*) 

21 

(35*) 

12        (32*) 

29        (27*) 

(35*) 

CU  Arithmetic  Operand 
Fetching 

8 

(15*) 

It     (  8*) 

U       (  5*) 

12 

(20*) 

12        (32*) 

16       (15*) 

(1U*) 

Total  Active  CU  Time 
(clockfl) 

53 

53 

77 

60 

38 

109 

Table  3d.   CU  Utilization  by  ASK  Program  II 


DIFF 
Per  Row 

ADW 
Per  Row 

ADVT 
Per  Row 

PRV 
Per  Row 

PRT 
Per   Row 

Poisson  Per 

Interation 

Average 

Address   and  Index 
Calculation 

23 

(Ul*) 

22 

(33*) 

20 

(23*) 

11* 

(22*) 

6 

(15*) 

30        (27*) 

(27*) 

Route  and  Modebit  Setting 

0 

(  o*) 

0 

(  0*) 

0 

(  o*) 

9 

(11**) 

0 

(  0*) 

6        (    5*) 

(    ■**) 

Control  Instructions 

8 

aw 

8 

(12*) 

8 

(   9*) 

8 

(12*) 

8 

(21*) 

28        (26*) 

(lfi«) 

PE  Instruction  decoding 

17 

(31*) 

32 

(1*9*) 

56 

(65*) 

22 

(31**) 

12 

(32*) 

30       (27*) 

(UO*) 

CU  Arithmetic   Operand 
Fetching 

8 

(II**) 

it 

(  6*) 

h 

(  1**) 

12 

(18*) 

12 

(32*) 

16       (15*) 

(13*) 

Total  Active  CU  Time 
(clocks) 

56 

66 

88 

65 

38 

110 

31 


Table  U.   PE  Utilization 
Table  l*a.   PE  Utilization  by  CLYPNIR  Program  I 


til! 

F 

a:v/v 

Per  Pow 

ADVT 
Per   Row 

PRV 
Per  Row 

PRT 

Per    Pow 

Poisson   Per 

Interation 

Average 

ess   and    Index 
'al  -illation 

10 

(16*) 

Ut 

(32*) 

26       (lit*) 

l-        (16*) 

12 

(19*) 

1*3       (26* 

(17*) 

Route  and  Modebit  Setting 

B 

(12*) 

17 

(15*) 

21*        ( 12*  ) 

10       (10*) 

0 

(   0*) 

10       (   '-'1 

(10*.) 

'ontrol    Instructions 

0 

(  o*) 

0 

(  at) 

0        (    0*) 

0        (   01) 

0 

(  o*) 

38       (201) 

(   5'?.) 

Ar  i  thmet  i      Instructions 

'.7 

(72*) 

Bi 

(7??.) 

Ul       (7«t*) 

75       (75*) 

52 

(81*) 

91 

» 

PF  and   PEM  overlap 

19 

H) 

(19*) 

2?        (12*) 

\i        (16%) 

10 

(16*.) 

1*2        (2?*) 

rotal  Active  PF.  Time 
■ .  i  :ki 

- 

117 

191 

101 

-.4 

187 

at-le  Ifb. 

PE  Utilization 

by  GLYPNIR  Program  II 

DIFF 
F-er   Pow 

ADW 
Per   Row 

HDVT 

Per    Pow 

PPV 
Per      ow 

PRT 
Per   E 

ow 

Poisson   Per 
Interation 

Average 

address    >"1    l»iei 
:     i  lotion 

'0 

■ 

58 

(31*) 

89       (33*) 

(21**) 

12 

(19*) 
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8.   Use  of  ILLIAC  Disk  for  Non-core  Contained  Meshes 

When  the  problem  being  run  becomes  too  large  to  contain  in 
core,  the  I/O  time  becomes  an  important  consideration.   First  the 
I/O  system  for  ILLIAC  IV  will  be  discussed  and  then  its  efficiency 
for  our  specific  problem  -will  be  examined. 

8.1  ILLIAC  I/O  System 

The  ILLIAC  IV  Disk  is  a  15,600  K  word  memory.   This 
memory  is  divided  into  52  bands.   Each  band  is  divided  into  300  pages, 
each  containing  l6  lines  of  PEM  (1028  words).  Memory  transfers  be- 
tween the  ILLIAC  IV  Disk  and  PEM  are  restricted  to  pages.   A  data 
request  can  read  or  write  up  to  128  consecutive  pages  on  one  band. 
If  the  data  is  not  consecutive  or  if  it  is  spread  over  more  than 
one  band  a  separate  request  is  needed  for  each  string  of  consecutive 
pages.   The  time  required  for  the  Disk  to  prepare  to  perform  a  data 
request  is  equivalent  to  the  time  to  transfer  two  pages  of  data  between 
the  Disk  and  PEM.   Thus  if  a  data  request  reads  or  writes  on  page  i 
and  band  j ,  the  next  data  request  should  skip  at  least  two  pages  and 
start  at  page  i  +  3  of  band  k  where  j  need  not  equal  k.   If  the 
second  data  request  wanted  page  i  +  1 ,  or  i  +  2 ,  the  request  would 
have  to  wait  for  a  revolution  of  Disk  before  it  could  be  executed. 
The  transfer  rate  between  PEM  and  Disk  is  133  usee  per  page  and  the 
disk  revolves  once  every  i+0  msec.   One  revolution  of  the  disk  is 
equivalent  to  6140,000  ILLIAC  IV  clocks. 

8.2  1/0  for  the  Bernard-Rayleigh  Convection  Problem 

Now  let  us  examine  the  Bernard-Rayleigh  convection  problem 

6  (t) 

where  each  mesh  is  196  x  256.    Five  meshes  are  required:   T    , 

T     ,  n    ,  n      and  ¥    •   Each  mesh  will  be  divided  into  pages 

containing  four  mesh  rows,  each  of  which  is  divided  into  four  PE  rows. 


5This  size  was  chosen  to  illustrate  the  use  of  ILLIAC  Disk 
when  ¥  andn  are  completely  core  contained.   If  this  mesh  size  is  ex- 
ceeded, the  solution  of  SOP  becomes  highly  1/0  bound.   However,  if  the 
mesh  size  becomes  this  large  it  is  best  to  go  to  direct  methods  for 
solving  Poisson's  equation.   Direct  methods  require  only  one  core  con- 
tained mesh  for  optimal  efficiency,  thereby  allowing  a  larger  mesh. 
Examples  of  1/0  considerations  in  using  direct  methods  are  given  in 
Erickson  (1972). 
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Each  time   step  of  the  convection  process   has  three  main 
parts:      calculate   n  ,    calculate  T^T  and  calculate   V 

n<T+1)  is  calculated  using  „£>    ,,  n^.   n^.  ^      ,     and 

¥«?]    ....  T?T+1)        is  calculated  using  t[TJ    .  ,t!T?., 

l+l,  J+1  1,3  &     i±l,J        l,J+l 

(t-1)        (t)  (t)  (t)  (t) 

T.V    ,      ,    V;','    IA_  ,    ¥:^;    .    and  V.    '      .      Through  use  of  all   n        , 
I, J  l+l,  J+1'      l+l,  j  i,j+l  & 

H7.     .        is   calculated.      The  process  will  be   divided  into  two  parts. 

First   n     +1     and  T(x+1')  will  be   calculated.      Then  y'T+1)  will  be 

calculated. 

Tp  and  T  will  be   calculated  one  page  at  a  time. 

When  page   I  of  T|  and  t'T+1^   is   being  computed,    page  1-1  of  t'T+   ', 

(r)  (t) 

Tv  '   and  Tp  '  will  be  written  on  Disk.   Also  during  the  calculation, 

(t)        (t-1)  Ct-1) 

page  1+2  of  Tv  ',  Tv    '   and  T\ v    '  are  read  from  Disk  into  a  12  page 

(t)  (t) 

storage  queue,  S,  in  PEM.  Tp   and  ^  '   are  completely  contained  in  core, 

When  page  I  of  T^T+  '   is  being  calculated,  page  1-2  of  Tp     overwrites 

(t) 
page  1-2  of  TV  ' . 


In  51  steps,  the  data  is  read  from  Disk  to  PEM.   A  com- 
plete time  iteration  for  the  prognostic  equations  is  calculated  and 
the  data  is  written  on  Disk.   Steps  1-1+8  read  information,  steps 
3-50  calculate  r\  and  T     ,  and  steps  U-51  write  the  appropriate 

information  on  Disk.   Figure  h     shows  what  pages  are  read,  calculated, 

(x-l) 
or  written  at  each  time  step.   Figure  5  shows  the  Diskmap  of  T     , 

T  '   and  n    '■  R  (W  )  refers  to  the  I   read  (write)  performed.   C 

th 
refers  to  the  I   group  of  pages  upon  which  calculations  are  performed. 

(t)  (t+1)         (t+1) 

At  this  point  the  n      mesh  contains  n    •  All  of  n 

and  T^T+1^  have  been  calculated,   n   ^  ,  I  '       and  T^T+1^have  been 

written  on  Disk  to  be  used  in  the  next  time  .iteration.   As  seen  in 

Figure  5  ,  each  read  or  write  requires  five  pages  of  Disk — three 

pages  for  data  and  two  pages  which  are  passed  while  the  data  request 

is  interpretted.   Thus,  the  I/O  for  a  page  of  the  prognostic  equations 

takes  at  least  1330  ysec,  the  time  needed  to  pass  ten  pages  on  the  Disk. 
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Subscript  J  on  i]  and  T  refer  to  the  J   page  of  the  mesh 
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ASK  code  can  solve  the  prognostic  equations  at  a  rate  of  70?  ysec 
per  page.   Thus,  both  the  calculations  and  the  I/O  for  one  page  of 
the  prognostic  equations  could  he  performed  in  1330  ysecs.   The 
problem  is  then  i/o  bound  i+7%  of  the  time. 

A  GLYPNIR  program  can  solve  the  prognostic  equations  at  a 
rate  of  1563  ysecs  per  page.   By  passing  12  pages  instead  of  ten 
pages  for  each  page  of  prognostic  calculations,  the  I/O  and  the 
calculations  for  one  page  of  the  prognostic  equations  could  be 
performed  in  1596  ysec.   This  problem  is  I/O  bound  2%   of  the  time. 

By  writing  DIFF  and  PRV  in  ASK  and  ADVT ,  ADW  and  PRT 
in  GLYPNIR,  a  page  of  the  prognostic  equations  could  be  calculated 
in  1295  ysec.   This  GLYPNIR/ASK  Program  saves  the  programming  cost 
of  writing  ADVT,  ADW  and  PRT  in  efficient  ASK  without  increasing  the 
computer  time  required  for  execution. 

Using  a  queue  instead  of  full  meshes  requires  extra  code 
to  keep  the  indeces cycling  properly.   Although  a  thorough  study 
has  not  been  performed  on  the  queue  indexing, it  is  estimated  that 
it  will  take  little,  if  any,  additional  computer  time.   This  is 
because  the  queue  indexing  code  consists  of  CU  instructions,  most 
of  which  can  be  overlapped  with  the  PE  and  PEM  fetches  already 
needed. 

If  the  GLYPNTR/ASK  Program  or  the  partially  ASK-coded  Program  is 
used,  the  prognostic  equations  will  be  advanced  one  time  step  in  the  amount 
of  time  for  the  Disk  to  revolve  once  (300  pages)  and  pass  195  pages  on  the 
next  revolution.   Due  to  the  fact  that  data  is  written  25  pages  after  it  is 
read  (see  Figure  5),  the  Disk  will  be  lU5  pages  from  the  point  where  the 

calculation  for  the  prognostic  equations  could  be  begun  for  the  next 

(t+1)  (t) 

time  step.   Now  ^      is  calculated  using  SOR  with  n    as  the 

(t)  7 

source  term  and  \\j  as  the  initial  guess  .   Each  iteration  for  the 

ASK  code  takes  the  amount  of  time  required  for  the  Disk  to  pass  51  pages. 


7t    (t)      (t) 
If  n    and  \p         were  not  core  contained,  it  would  require 

1^0  ysec  per  page  to  update  ty   using  partially  ASK-coded  SOR,  whereas  it  would 
require  at  least  532  jusec  to  perform  the  necessary  I/O. 
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Thus ,  a  time  step  which  solves  the  prognostic  equations  and  performs 

N  iterations  of  SOE  can  be  performed  in  the  time  required  for  1+2 

revolutions  of  Disk  and  ^0  pages  passed  on  the  1+3  revolution  where 

(for  N>2) 

T  .   |31N  -  11+5 1 

I    300    |   • 

As  N  increases,  the  percent  of  i/O  bound  decreases  but  not  monotonically . 
The  percent  of  the  time  during  which  the  partially  ASK-coded  Poisson  equation 
is  i/O  bound  is  given  by 

PI0=     100(1-°- scorer' 

Figure  6  contains  graphs  of  percent  i/O  bound  as  a  function 

of  the  number  of  iterations  in  SOE  for  the  Poisson  equation  and  the 

entire  convection  problem. 
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9 •   Cone luding  Remarks 


It  is  hoped  that  this  paper  will  be  useful  to  two  classes  of 
readers.   First,  the  fluid  dynamicist  should  find  this  study  useful  in 
deciding  whether  it  is  worthwhile  to  develop  codes  for  ILLIAC  IV.   Second, 
ILLIAC  IV  language  developers  should  find  the  reported  statistics  useful  in 
determining  priorities  for  developing  efficient  high  level  languages. 

As  demonstrated  in  the  preceding  sections,  ILLIAC  IV  is  well 
suited  for  solving  the  class  of  hydrodynamic  problems  considered  in  this 
paper.   However,  care  must  be  taken  in  choosing  an  appropriate  mesh  size 
and  storage  scheme. 

Needless  to  say,  the  best  efficiency  can  be  achieved  by  writing 
programs  in  ASK.   Only  the  programmer  can  scan  the  whole  problem  and  write 
the  program  which  maximizes  PE,  PEM  and  CU  overlap.   However,  GLYPNIR  is 
much  easier  to  use  in  developing  codes.   Over  the  past  year,  GLYPNIR 
efficiency  has  significantly  improved.   It  is  felt  that  further  improvement 
can  still  be  accomplished  through  the  use  of  ACAR  indexing,  implementation  of 
CU  IF  statements,  and  an  improved  method  of  passing  vectors  to  and  from  sub- 
routines. 


ko 
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Characteristics  of  ILLIAC  IV 


Appendix  1 
8 


The  five  major  elements  of  the  ILLIAC  IV  are  the  Control 
Unit  (CU),  the  Processing  Unit   (PU) ,  the  Input-Output  (i/O)  subsystem, 
The  Disk  File,  and  the  B6T00  Control  Computer.   Each  PU  is  a  combina- 
tion of  a  Processing  Element  (PE),  Memory  Logic  Unit  (MLU),  and 
Processing  Element  Memory  (PEM).   A  CU  directly  governs  6h   PU's  con- 
figured in  an  array  shown  in  Figure  7 • 

The  Control  Unit  is  that  portion  of  the  computer  system  which 
performs  the  initial  processing  of  instructions.   Instructions  are 
fed  in  turn  from  the  instruction  stack  to  the  Advanced  Station  (ADVAST), 
which  is  the  principal  housekeeper  of  the  system  where  such 
functions  as  address  arithmetic,  loop  control,  mode  control,  inter- 
rup  processing,  and  configuration  control  are  performed.   ADVAST 
permits  all  those  functions  generally  associated  with  program 
control  to  be  performed  concurrently  with,  but  in  advance  of  and 
separate  from,  the  main  processing  activity. 

The  primary  function  of  the  Final  Station  (FINST)  is  to  act 
as  an  intermediary  between  the  main  control,  in  ADVAST,  and  the  6h 
array  elements,  called  Processing  Elements  (PE's).   The  repertoire 
for  ILLIAC  IV  has  two  general  categories  of  instructions,  those 
executed  as  ADVAST  and  those  executed  in  the  array  elements.   Those 
instructions  intended  for  execution  in  PE's  are  transferred  from 
ADVAST  to  FINST  where  they  are  decoded  and  then  broadcast  to  6h 
array  elements  (FINST  also  broadcasts  full-word  operands,  shift 
counts,  test  values,  and  other  common  array  parameters). 

The  array  element,  or  PE,  is  the  execution  portion  of  the 
configuration;  with  the  exception  of  mode  and  some  data-dependent 
conditions,  this  unit  is  devoid  of  all  independent  control.   Mode  control 
permits  a  PE  to  accept  or  ignore  a  broadcast  control  sequence  from 
the  CU,  dependent  on  current  status  of  mode  bits.   The  PE  is  essentially 


See  Denenberg  (l9Tl)  and  "Illiac  IV  System  Characteristics 
for  more  details . 
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a  four-register  arithmetic  unit,  capable  of  executing  a  full 
repertoire  of  instructions  having  64-bit,  32-bit,  or  8-bit 
operands.   Further,  operations  involving  64-bit  and  32-bit  words 
can  be  done  in  either  fixed-point  or  floating-point  representa- 
tion. 

The  array  memory  consists  of  64  independent  memory  modules 
called  Processing  Element  Memories  (PEM's).   Each  PEM  is  collocated 
with  and  assigned  to  a  specific  PE,  providing  storage  for  2048  words 
of  64  bits  each.   An  address  adder  and  an  index  register  within  the 
PE  permit  memory  indexing  and  addressing  independent  of  FINST  control 
Such  independence  provides  important  flexibility  for  addressing 
data  stored  in  a  variety  of  ordered  forms. 

It  is  apparent  that  ILLIAC  IV  may  be  suited  to  the 
execution  of  algorithms  defined  on  arrays  of  data,  e.g.,  matrix 
operations  and  certain  partial  differential  equations.   However, 
there  are  two  major  difficulties  in  using  such  a  machine.   One 
is  that  an  algorithm  must  be  devised  which  allows  identical 
computations  to  be  performed  simultaneously  on  a  large  number  of 
operands.   The  other  is  closely  related  to  the  first:   the  data 
must  be  so  placed  in  the  ILLIAC  IV  memory  so  that,  over  the  course 
of  the  calculation,  few  PE's  are  disabled. 
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Appendix  3 
Storage  Schemes 

3.1  Straight  Storage  for  6k   or  less  Columns 

This  scheme  stores  the  row  vector  a.  _,  a.  _,  ,  a.  _,  ... 

1,0   i,l   l  ,2 

across  PE's  starting  at  PEO,  that  is,  a.    is  stored  in  location  A  +  i 
of  PEj  .  (see  Figure  3).  This  scheme  is  adaptable  to  parallel  accessing 
of  rows;  however,  if  parallel  accessing  of  columns  is  often  necessary, 
this  scheme  will  not  suffice.   Since  all  "boundary  points  a   ,  a  _,  .. 
a^   Q  and  a  ^   ,  .  . . ,  a^     are  stored  in  PEO  and  PE[(N+l)  mod  6k] 
respectively,  boundary  calculations  require  disabling  of  all  PE's 
but  those  which  contain  boundary  points.   If  an  array  or  a  partial 
block  of  an  array  has  rows  of  dimension  less  than  6k,   then  inefficient 
operation  will  also  result  in  their  rows. 

3.2  Skew  Storage  for  6k   or  less  Columns 

This  storage  allocation,  distributing  columns  as  well  as 

rows  across  the  PE's,  will  allow  access  of  an  entire  row  or  column 

in  parallel.   In  this  scheme  the  row  vector  a.  n.    a.  .  ,  a.  „,  ...   is 

i,0'      i,l'      i,2' 

stored  across   PE's  with  a.     .    in  location  A+i   of  PE[(i+j)   mod  6k]. 

PEO  PE1  PEj 

loc  A  a  a     ..        ....  a      .  ■    .    . 

0,0  0,1  0,3 

Loc  A+l         an    ^_  a,    _       .    •    .  a_    .   .        •    .    . 

l,o3  1,0  ljj-l 


Loc  A+i         a.)(0_i)mod6i;  a.j(l_i)inod6i+  <    >    _  ft^    (j-i)mod6U.. 


Loc  A+M 
Loc  A+M+l 


aM,    (0-M)mod64  ^,(1   -M)mod64.    .    .  aM,  (j-M)mod64 

aM,    (0-M-l)mod64  aM,(l-M-l)mod64   •••        ^  (j-M-l)mod64 
Figure    8.     Skew  Storage 
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Skew  storage  has  proved  useful  in  matrix  operations  in 
Gaussian  elimination  with  column  pivoting,  and  in  alternating 
direction  implicit  methods  and  explicit  partial  differential 
equation  calculations  where  boundaries  must  be  treated  by  a  fairly 
complicated  algorithm  which  is  quite  different  from  the  algorithm 
operating  on  interior  points. 

3- 3  Odd- Even  Storage  for  128  or  less  Columns 

This  storage  scheme  uses  two  adjacent  lines  of  PEM  to  store 
each  row.   The  rows  appear  in  the  same  relative  order  as  in  straight 
storage:  that  is,  this  scheme  stores  a.  _.  .-at  PEj  in  location  2i+  i " 
where  j'is  0  or  1.   If  the  number  of  elements  in  a  row  is  less  than 
or  equal  to  6k,    straight  storage  or  skew  storage  requires  half 
as  much  PEM  as  required  by  an  odd-even  storage-   But  if  that  number 
is  between  65  and  128,  straight  storage  and  odd-even  storage  require 
the  same  number  of  rows  in  PEM. 

Odd-even  storage  has  been  used  in  applications  of  the 
Successive  Line  Overrelaxation  method  and  partial  differential 
equations  which  require  fetching  a  number  of  neighboring  points 
into  the  register    (see  Ericksen  (1972)). 
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Figure    9«     Odd -Even  Storage 
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3.k     Skewed  Odd-Even  Storage   for  128  or  less    Columns 

This  stores  a_ .  ,  .  ^  ,-,.,„.-  at  PE[(i+j)  mod  6k]   in  location 
2 l+i  ,2j+j 

A+2(2i+i  ')+j  "  t   where  i-"  and  j  '  are  0  or  1.   This  has  advantages  and 
disadvantages  similar  to  those  of  odd-even  storage;  in  addition,  this 
scheme  permits  accessing  the  column  simultaneously. 

Additional  storage  schemes  such  as  vn   -  skew  storage  and 
modified  skew  storage  for  the  QR-algorithm,  triangular  arrays, 
triangular  1-skew  storage  for  symmetric  matrix,  and  others  have 
been  devised.   However,  the  nature  of  the  problem  considered  here 
only  requires  consideration  of  the  four  alternate  storage  schemes 
described  above. 

3.5   Storage  Scheme  Suitable  for  this  Type  of  Problem 

In  the  computation  of  new  vorticity  and  temperature 

distributions ,  it  is  necessary  to  use  old  values  on  eight  neighboring 

points.   Consider  fetching  these  neighboring  points  of  a.  .,  that  is 

i  >0 

a .    _     .      ,    a  ,   a.     .    n ,    and  a.         .to  the  PE  where  a.     .is 

1+1,3-1'      1+1,  j+1'      i,o+l'  1+1,0  1,0 

stored  and  where  calculation  for  the  new  value  of  a.  .is 

i,0 
performed. 

It  is  obvious  that  nine  fetches  are  required  regardless  of 
the  storage  scheme,   but  the  number  of  routings  required  to  place 
data  into  the  register  of  an  appropriate  PE  depends  on  the  storage 
scheme.   It  can  be  seen  that  six  routings  by  distance  one  are 
required  for  straight  storage,  three  by  distance  one  for  odd-even 
storage,  and  four  by  distance  one  and  two  by  distance  two  for  the 
skew  storage  (see  Figure  10 ).   Finally,  four  routings  by  distance  one 
are  required  for  skewed  odd-even  storage  when  i+o  is  odd,  while  four 
routings  by  distance  one  and  one  routing  by  distance  two  are  re- 
quired for  skewed  odd-even  storage  when  i+o  is  even.   Therefore, 
where  calculations  of  interior  points  are  concerned,  straight  storage 
and  odd-even  storage  have  advantages  over  skew  storage.   On  the  other 
hand,  skew  storage  has  advantages  in  computing  boundary  values.   Odd- 
even  storage  has  an  advantage  over  straight  storage  where  6k   <    (N+2)  mod  128, 
However,  programs  using  this  scheme  become   complicated  and  requires 
more  time  for  controlling  the  instruction  stream.  The  best  storage 


hi 

scheme  for  the  problem  at  hand  must  he  chosen  only  after  considering 
the  amount  of  boundary  calculations,  the  nature  of  difference  schemes 
used,  the  size  of  mesh,  etc. 

Besides  choosing  a  storage  scheme,  the  side  of  the  rectangular 
mesh  that  should  be  stored  across  PE's  must  be  determined  considering 
the  memory  utilization,  PE  efficiency  and  the  amount  of  overhead  due 
to  the  control  statements.   For  the  problem  addressed  in  this  paper, 
straight  storage  was  used  throughout . 
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Appendix  4 

Considerations  in  the  use  of  the  32-bit  Mode 

ILLIAC  IV  allows  the  programmer  to  use  64-bit  or  32-bit 
words.   GLYPNIR  allows  only  the  64-bit  mode,  but  if  the  program  is 
written  in  ASK  then  the  32-bit  mode  can  be  used.   The  32-bit  mode 
effectively  doubles  core  per  PE  (1*096  32-bit  words)  and  usually 
requires  less  memory  fetches.   64-bit  operations  require  more  than 
half  the  time  of  the  corresponding  double  32-bit  operation,  e.g., 
ADRN  takes  7  clocks  to  add  two  64-bit  numbers  in  64-bit  mode  but  it 
requires  10  clocks  to  perform  the  two  sums  of  two  32-bit  pairs  in 
32-bit  mode.   Although  we  have  not  coded  the  convection  problem  in 
32-bit  mode,  it  is  estimated  that  a  convection  problem  with  M  x  128N 
meshes  could  be  solved  in  60-70%  of  the  time  required  in  the  64-bit 
mode. 

32-bit  mode  presents  two  problems  which  require  solutions 
unique  to  32-bit  mode.   Routing  is  the  first  problem.   Bernhard 
(1972)  presented  the  solution  to  this  problem.   An  N  element  vector 
will  be  stored  with  elements  1  to  N/2  in  the  outerwords  and 
elements  N/2+1  to  N  in  the  innerwords .   On  a  four  PE  machine,  a 
16  element  vector  would  be  stored  as  follows: 

PEO         PE1         PE2        PE3 
A+0      r±   ap      a2  a1Q      a3  a^      a^  &12 

A+l      a5  a13     a6  a^      a?  a^      ag  a±6 

With  this  storage  scheme,  the  appropriate  values  can  be  obtained  in 
all  the  PE's  with  the  same  type  of  routing  as  in  the  64-bit  mode 
except  when  PEO  is  required  to  access  the  element  to  the  right  of 
a  ,  i.e.,  a^.   A  route  of  one  to  the  right  places  aft  in  the  outer 
word  of  register  R  in  PEO.   A  "SWAP"  (the  inner  and  outer  words  are 
exchanged)  must  take  place  to  put  a„  in  the  inner  word  of  register  R 
of  PEO. 
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Accessing  both  rows  and  columns  is  the  second  problem. 
Bernhard's  (1972)  storage  scheme  is  fine  for  accessing  128  columns 
simultaneously  but  the  storage  must  be  changed  to  32-bit  skewed 
storage  to  access  rows  simultaneously.  An  8  by  8  mesh  in  32-bit 
skew  storage  on  a  four  PE  machine  would  look  like 

A+0      a^  a5a  a1>2  a^2   a^  a^   a.±  ^   a^ 

A+1     al,5  a5,5  al,6  a5,6   ai,T  a5,7   al,8  a5,8 
A+2      a2^  a6^   a^  a^   a^  ag     a^  a^ 


A+3 
A+U 


^2,8  %8  d2,5  6,5   d2,6  d6,6   d2,7  6,7 
a3,3  a7,3  a3,U  al,k       a3,l  a7,l   a3,2  a7,2 


A+5     a3,7  a7,7  a3,8  a7,8   a3,5  a7,5   a3,6  aT,6 


A+6 


ak,2   a8,2  ah,3  a8,3   aU,U  a8,U  %,i   a8,l 


A+T     ali,6  a8,6  %,7  a8,7   %,8  a8,8   aU,5  a8,5 


Figure  11.   32-bit  Skew  Storage 


Note  that  32-bit  skew   storage  enables  the  access  of  128  rows  or 
columns  simultaneously  on  ILLIAC  IV  and  routes  work  without  swaps. 
If  one  wanted  to  calculate  a.  .  -  lA(a.  -,  >-?   +  a.  ,   )  for  either 

!>J  1-1  J        1+1)J 

i=4  or  i=5 ,   swaps  would  have  to  be  performed  in  the  above  example 
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5. 1      Convection  Program  I 
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Table  i   indicates  that   Prcfcram  II 


takes 


28%  more  time  than 


Program  I  when   comparing  GLYPIOP  cocjes   and  k0%  more  time  vhen   comparing 
partial -ASK  eodedj   for  the  prognostic]  equations.      This    compares   to 
a   difference   of   %   for   GLYFmii   ^    ig%   ±n  partial_ASK  fQr  Poisson^ 
equation.      The  reason   for  these   differences   is   that  Program  II 
requires   additional  PE  indexing  in   the  prognostic   equations  whereas 
in  Poisson's   equation  the   need  for  PE  indexing  is   nearly  the   same   in 
both  Programs   I   and  II  due  to  the   SOP  algorithm  employed.      PE  indexing 
in  GLYPMTR  is   discussed  in  Section  7.1. 


